NSF PAR Search | NSF Public Access Repository

Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning

Zhou, Zhengqing and (January 2021, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics)

Banerjee, Arindam and (Ed.)

While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $$O_P\left(1/\sqrt{n}\right)$$ regret, meaning that with high probability, the policy learned from using $$n$$ training data points will be $$O\left(1/\sqrt{n}\right)$$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.

Full Text Available

Search for: All records